In November 2016 I worked for Parkinson's UK to help them finding out more about people who used their services (members, shops, library, support group, forum, and donations). During the 5 weeks project, I downloaded their public Facebook's Page and decided to analyse it after our project had ended (5 weeks was clearly not enough to do everything they had in mind).
This post is the first of three posts analysing that dataset.
At the start of our project, we discussed the possibility to use some Machine Learning techniques to find who of the users had the Parkinson's condition. This would have allowed us to find different behavioral patterns between those with the Parkinson's condition, their carers and families, health care practitioners, or researchers. One of the main issue in using Machine Learning in this context, is the requirement to have a pre classified set. I decided to manually classify it, and then to see if there were different speech patterns between people with Parkinson's (pwp) and others (carers, health care practitioners, or researchers).
PART 1 . In a first part, I started by analysing the difference of speech between Parkinson's UK (the owner of the page - here I call them PUK) and their readers (PUKreaders). I found that they had similar center of interests (diagnosis, medication, research, raising money and awareness), with different priorities (PUK focused on research and PUKreaders on medication). They also used different vocabulary: PUK used condition, while PUKreaders used disease.
PART 2 . Then, using a similar approach, I will be analysing the difference of speech between people with Parkinson's and the others.
PART 3 . Finally, I will look at the successful posts - the posts that attracted more comments, likes, and shares; to find the patterns of success (posting a video, talking about a specific story, or a treatment).
from __future__ import division
import json
import nltk
from nltk import bigrams
from collections import Counter
import re
from nltk.corpus import stopwords
from nltk.text import TokenSearcher
import nltk.collocations
import nltk.corpus
import collections
from nltk import word_tokenize, FreqDist
import string
import io
import sqlite3
import numpy as np
import pandas as pd
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
### BOKEH
from bokeh.charts import Bar, Scatter, output_file, show
from bokeh.io import output_notebook, push_notebook
from bokeh.charts.attributes import CatAttr
from bokeh.plotting import figure, output_file, show, ColumnDataSource
output_notebook()
from bokeh.models import HoverTool
from bokeh.layouts import gridplot
from bokeh.models.ranges import Range1d
0 - Clean the data¶
I started by removing the stopwords (common English words) using NLTK Natural Language Toolkit, and the punctuation using the string
python library. I also made parkinson's uk as one word to be able to separate it from parkinson's, and corrected some obvious mispelling that appeared in very common words.
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['https','http','org',u'“',u'’',u'–','www']
dic_replace = {"parkinson u'\u2019' s uk ":'parkinsonsuk ',"parkinson u'\u2019' s":'parkinsons',
"parkinson's uk ":'parkinsonsuk ',"parkinson's":'parkinsons',
'carers ':'carer ','thank you ':'thanks','weeks ':'week',
'treatmen ':'treatment','sympto ':'symptoms','symptom ':'symptoms',
'dads ':'dad', 'mums':'mum', 'years':'year'}
def tokenize(s):
return nltk.word_tokenize(s)
def preprocess(s):
s = s.lower()
for w in dic_replace:
s = s.replace(w,dic_replace[w])
tokens = nltk.word_tokenize(s)
tokens = [token.lower() for token in tokens if token.isalpha()]
return tokens
def lightclean(s):
s.lower()
for w in dic_replace:
s = s.replace(w,dic_replace[w])
for p in punctuation:
s = s.replace(p,'')
return s
def cleantext(fname,analysis_name):
error = 0
with open(fname, 'r') as f:
count_stop = Counter()
count_bigram = Counter()
for line in f:
posts = json.loads('{}'.format(line))
for post in posts:
try:
terms_stop = [term for term in preprocess(post['content'])
if term not in stop]
terms_bigram = bigrams(terms_stop)
terms = [term for term in preprocess(post['content'])
if term not in stop and len(term) != 1]
except:
error += 1
count_stop.update(terms_stop)
count_bigram.update(terms_bigram)
nElements = 50
with open('bigrams_'+analysis_name+'.txt', 'w') as f:
f.write(str(count_bigram.most_common(nElements)))
word_freq = count_stop.most_common(nElements)
# Export the word frequency to json
with io.open('wordfreq_'+analysis_name+'.json', 'w', encoding='utf-8') as f:
f.write(unicode(json.dumps(word_freq, ensure_ascii=False, encoding='utf8')))
cleantext('posts.json','all')
cleantext('posts_puk.json','puk')
cleantext('posts_pukreaders.json','pukreaders')
1 - PUK vs PUKreaders¶
I divided my dataset in two groups: Parkinson's UK (PUK) and their readers (PUKreaders). I expected that PUK andPUKreaders would use different terminologies but would also have different centers of interest regarding the condition.
Let's first look at the number of posts written:
with io.open('posts_puk.json',encoding='utf-8') as f_puk, io.open('posts_pukreaders.json',encoding='utf-8') as f_pukreaders:
posts_puk = json.loads(f_puk.read(), encoding='utf8')
posts_pukreaders = json.loads(f_pukreaders.read(), encoding='utf8')
print "Parkinson's UK wrote ", len(posts_puk), 'posts.'
authors = []
content_puk = []
content_pukreaders = []
for i in range(len(posts_pukreaders)):
authors.append(posts_pukreaders[i]['person_hash_id'])
content_pukreaders.append(lightclean(posts_pukreaders[i]['content']))
for i in range(len(posts_puk)):
content_puk.append(lightclean(posts_puk[i]['content']))
authors = set(authors)
print 'Their readers wrote', len(posts_pukreaders), 'posts,','written by', len(authors), 'authors; '\
'which is', round(len(posts_pukreaders)/len(authors),1), 'posts per authors.'
1 - a. Word frequency¶
I prepared the data to make a bar chart of the 50 most common words in both PUK and PUKreaders posts. To do this, I used plain python with a Counter()
to count to number of times words appeared in the posts, ordered the list, and then took the 50 first words for each group.
with open('wordfreq_puk.json', 'r') as fpuk, open('wordfreq_pukreaders.json', 'r') as fpukreaders:
words_puk = json.load(fpuk)
words_pukreaders = json.load(fpukreaders)
freqwords_all = []
freqwords_puk = []
freqwords_pukreaders = []
for word in words_puk:
freqwords_puk.append(word[0])
freqwords_all.append(word[0])
for word in words_pukreaders:
freqwords_pukreaders.append(word[0])
freqwords_all.append(word[0])
# contains the list of 50 freq words between PUK and PUK readers
freqwords_all = list(set(freqwords_all))
freqwords_puk = list(set(freqwords_puk))
freqwords_pukreaders = list(set(freqwords_pukreaders))
# prepare data for the bar chart
df_puk = pd.DataFrame(words_puk,columns=['word','PUK'])
df_pukreaders = pd.DataFrame(words_pukreaders,columns=['word','PUKreaders'])
df = pd.merge(df_puk,df_pukreaders,on='word',how='outer')
df['diff'] = (df['PUK'] - df['PUKreaders']).fillna(0)
df = df.sort_values(['diff'])
df_plot = pd.melt(df,id_vars=['word'],value_vars=['PUK','PUKreaders'])
The following bar chart shows the 50 most used words by both groups. It's a bokeh graph, so you can interact with it: hover to read the underlying data, zoom and pan. The javascript library is still in development, so it's a big buggy in this Notebook (if you get lost on the zoom, just reload this page). Hopefully this will improve soon.
hover = HoverTool(
tooltips=[
("value", "@y"),
]
)
bar = Bar(df_plot, label=CatAttr(columns=['word'], sort=False), values='value',
tools=[hover,'pan', 'wheel_zoom'],
toolbar_location="above",
stack='variable', title="Frequency of 50 most frequent words",
width=600, height=300,legend='top_right',bar_width=0.7)
bar.xaxis.major_label_orientation = 20
bar.xaxis.major_label_text_font_size = '8pt'
bar.xaxis.axis_label = None
bar.yaxis.axis_label = None
show(bar,notebook_handle=True);
This bar chart is ordered by difference between the frequency of each word in both group's text: from left with words more present in Parkinson's UK's readers' posts, to the right with words more present in Parkinson's UK posts. In the center, the words with only a green bar in are only frequent in Parkinson's UK posts, while the pinkish bar only are only frequent in readers of Parkinson's UK.
This shows four clusters:
- The words frequent only in PUK's text
- The words frequent only in not PUK's text
- The words frequent for both, more often in PUK
- The words frequent for both, more often in PUKreaders
We first look at all the text from either author type.
df_pukonly = df[df['PUKreaders'].isnull()]
df_pukreadersonly = df[df['PUK'].isnull()]
df_both = df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull())]
df_morepuk = df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull()) & (df['PUK']>df['PUKreaders'])]
df_morepukreaders = df.ix[(~df['PUKreaders'].isnull() | ~df['PUK'].isnull())& (df['PUK']<df['PUKreaders'])]
printmd('**Only frequent in PUKs posts**:')
print ', '.join(str(x) for x in df_pukonly['word'].values)
printmd('**Only frequent in PUKreaders posts**:')
print ', '.join(str(x) for x in df_pukreadersonly['word'].values)
printmd('**Frequent words in both**:')
print ', '.join(str(x) for x in df_both['word'].values)
printmd('**Frequent words in both, more frequent for PUK**:')
print ', '.join(str(x) for x in df_morepuk['word'].values)
printmd('**Frequent words in both, more frequent for not PUK**:')
print ', '.join(str(x) for x in df_morepukreaders['word'].values)
From this list, we can see that:
- Readers posting on Parkinson's UK page express themselves differently: they use polite words such as 'Hi', 'thanks', or 'please'.
- They refer to the Parkinson's condition as a 'disease' or 'pd' (for Parkinson's disease), while Parkinson's UK use the word 'condition'.
- Parkinson's UK only speaks frequently about 'diagnosis' (noun, factual and general terminology) or diagnosed (verb, emotional and passive), while readers only speak of 'diagnosed'; although it's not one of the most frequent
- Regarding their use of indefinite pronouns, Parkinson's UK uses most frequently 'something' (along with the word 'things') while readers of Parkinson's UK uses 'anyone'. This suggest that Parkinson's UK talk more about objects than people, while their readers talk more often about people.
- Both use similarly 'raise', 'support', 'awareness', making raising awareness it is their common interest
Frequent words per author, in both groups¶
This first analysis considered the entire text for each group, without distinguishing between authors (one person could have repeated the same word multiple times, and this one would artificially become frequent). Therefore, instead of considering the entire text, I looked at each new author's post in PUK's readers. I then counted the number of times it is mentioned by a new person.
I made a scatterplot to show the frequency of words with unique authors with Parkinson's UK frequent words in the x axis, and their readers in the y axis. All the words aligned on x=0 or y=0 are words that are not frequent in one of the groups; although it does not mean they did not use it.
You can hover on the scatter plot to see then word each circles refers to as well as zoom in.
with io.open('posts_puk.json',encoding='utf-8') as f:
puk = json.load(f)
with io.open('posts_pukreaders.json') as f:
pukreaders = json.load(f)
comp = []
def count_s(s,dataset):
count = 0
if dataset == puk:
for i in range(len(dataset)):
if s in dataset[i]['content']:
count += 1
comp.append(["Parkinson's UK",s,count,round(100*count/len(dataset),1)])
else:
authorsaidit = []
for i in range(len(dataset)):
if s in dataset[i]['content'] :
if dataset[i]['person_hash_id'] not in authorsaidit:
authorsaidit.append(dataset[i]['person_hash_id'])
count += 1
comp.append(["Parkinson's UK readers",s,count,round(100*count/len(dataset),1)])
return comp
for w in freqwords_puk:
count_s(w,puk)
for w in freqwords_pukreaders:
count_s(w,pukreaders)
df_comp_all = pd.DataFrame(comp,columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
df_scatter = df_comp_all.copy()
df_scatter = df_scatter[['AuthorType','Word','Percentage']].set_index(['AuthorType','Word'],
append=True)
df_scatter = df_scatter.unstack('AuthorType')
df_scatter = df_scatter.stack(0)
df_scatter = df_scatter.reset_index().drop(['level_0','level_2'],axis=1)
df_scatter['PUK'] = df_scatter.groupby(['Word'])["Parkinson's UK"].transform('sum')
df_scatter['PUKreaders'] = df_scatter.groupby(['Word'])["Parkinson's UK readers"].transform('sum')
df_scatter = df_scatter.drop_duplicates('Word')
df_scatter = df_scatter.fillna(0)
source = ColumnDataSource(
data=dict(
x=df_scatter["PUK"],
y=df_scatter["PUKreaders"],
desc=df_scatter["Word"],
)
)
hover = HoverTool(
tooltips=[
("word", "@desc"),
]
)
scatter = figure(plot_width=550, plot_height=400, tools=[hover,'pan', 'wheel_zoom'],
toolbar_location="right")
scatter.circle('x', 'y', size=5, source=source)
scatter.xaxis.axis_label = "Parkinson's UK"
scatter.yaxis.axis_label = "Parkinson's UK readers"
scatter.title.text = "Most frequent words - separating authors"
show(scatter,notebook_handle=True);
I was struck by some words that seem to come in pairs because they expressed a common concept. I selected 6 and looked at them more closely.
- 'dad' vs 'mum'
- 'disease' vs 'condition'
- 'help' vs 'support'
- 'diagnosis' vs 'diagnosed'
- 'research' vs 'money'
comp = []
listoflistwords = [['dad','mum'],['disease','condition'],['help','support'],['diagnosis',
'diagnosed'],['research','money'],['family','friends']]
listwords = ['dad','mum','disease','condition','help','support','diagnosis',
'diagnosed','research','money','family','friends']
def barplotdata(listoflistwords,plotnb):
w2 = listoflistwords[plotnb-1]
for w in w2:
count_s(w,puk)
count_s(w,pukreaders)
return comp
df_comp1 = pd.DataFrame(barplotdata(listoflistwords,1),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp2 = pd.DataFrame(barplotdata(listoflistwords,2),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp3 = pd.DataFrame(barplotdata(listoflistwords,3),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp4 = pd.DataFrame(barplotdata(listoflistwords,4),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp5 = pd.DataFrame(barplotdata(listoflistwords,5),columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
comp = []
df_comp6 = pd.DataFrame([["PUK",' ',0,0.01],["PUK readers",' ',0,0.01],
["PUK",' ',0,0.01],["PUK readers",' ',0,0.01]],
columns=['AuthorType','Word','NbpostsAuthors','Percentage'])
bar1 = Bar(df_comp1, label='Word', values='Percentage',legend=None,
width=150,height=200,group='AuthorType')
bar2 = Bar(df_comp2, label='Word', values='Percentage',legend=None,
width=150,height=200,group='AuthorType')
bar3 = Bar(df_comp3, label='Word', values='Percentage',legend=None,
width=150,height=200,group='AuthorType')
bar4 = Bar(df_comp4, values='Percentage',legend=None,
width=150, label='Word',height=200,group='AuthorType')
bar5 = Bar(df_comp5, label='Word', values='Percentage',legend=None,
width=150,height=200,group='AuthorType')
bar6 = Bar(df_comp6, values='Percentage',
width=150,height=200,group='AuthorType')
barlist = [bar1,bar2,bar3,bar4,bar5,bar6]
df_complist = [df_comp1,df_comp2,df_comp3,df_comp4,df_comp5,df_comp6]
countbar = 0
for b in barlist:
b.y_range = Range1d(0,30)
countbar += 1
bar1.yaxis.axis_label = "Percentage of posts"
for b in barlist[1:5]:
b.yaxis.axis_label = None
b.xaxis.axis_label = None
bar1.y_range = Range1d(0,30)
bar2.y_range = Range1d(0,30)
bar3.y_range = Range1d(0,30)
bar4.y_range = Range1d(0,30)
bar5.y_range = Range1d(0,30)
bar6.y_range = Range1d(0,30)
bar6.axis.visible = False
bar6.ygrid.grid_line_color = None
bar6.outline_line_color = None
bar6.legend.spacing = 10
bar6.legend.padding = 0
bar6.legend.margin = 0
bar6.legend.border_line_color = 'white'
# make a grid
grid = gridplot([[bar1,bar2,bar3], [bar4,bar5,bar6]])
show(grid,notebook_handle=True);
Dad vs Mum
Both PUK and their readers speak more about dads than mums. Parkinson's condition is often diagnosed as people get older, and although there are more older women than older men, research shows that women are less at risk of having the Parkinson's condition (in the Western world, this is not true for Asian countries apparently). The prevalence of Parkinson's in the male population could therefore explain why male parents are more often mentioned than female parents.
On the other hand, Parkinson's UK does not speak proportionally as much of 'mum' (2.6 times less), in comparison to their readers (1.46).
Condition vs Disease
Although the Parkinson's condition is called Parkinson's disease in the common language, it is not a disease, as you can't be cured from it, but a condition. Therefore Parkinson's UK is careful to use condition rather than disease, but the distinction is not passed yet to their readers who use the term very unfrequently.
Help vs Support
Parkinson's UK speaks more about help and support than their readers, but the difference between both is relatively similar for Parkinson's UK (help/support = 1.44) than Parkinson's UK readers (help/support = 1.57).
Diagnosed vs Diagnosis
Here the difference between both is striking: Parkinson's UK speaks 1.8 times less about diagnosis than being diagnosed, while their readers speak 5.4 times less about diagnosis than being diagnosed. Therefore the state of diagnosed is much more important than the diagnosis in itself for Parkinson's UK's readers.
Money vs Research
Parkinson's UK talks much more frequently about research than money, while their readers talk more about money and very little about research (although it is definitely a frequent word, and therefore an interest to them).
1 - b. Context¶
With a Natural Language Processing package such as NLTK, it is possible to look now into the context in which these words are used. I first used a function which finds unique expressions, by cutting the text into 'tokens'. It creats a corpus used by NLTK to perform some common analyses.
text_puk = ' | '.join(x.lower() for x in content_puk)
text_pukreaders = ' | '.join(x.lower() for x in content_pukreaders)
textnltk_puk = nltk.Text(word_tokenize(text_puk))
textnltk_pukreaders = nltk.Text(word_tokenize(text_pukreaders))
def find_unique_exp(text,exp):
uniqu = []
match_tokens = TokenSearcher(text).findall(exp)
for x in match_tokens:
uniqu.append(' '.join(x))
#return list(set(uniqu))
return ', '.join(str(x.encode('utf-8')) for x in list(set(uniqu)))
Collocations¶
I first looked at the collocations; that is, the words that often appear together. We take the entire corpus and find all the pair of words that appear together.
printmd("**Parkinson's UK**")
print textnltk_puk.collocations()
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.collocations()
First, we can see that in Parkinson's UK corpus some words that appear often together refer to specific events or persons. That means that Parkinson's UK writes long post about them, where these names are repeatedly written, but they actually are not that frequent.
When we compare both, we see that Parkinson's UK talks about raising 'money' and 'awareness'. Obviously, their readers are more interested in 'raising money', although 'awareness week' was also used frequently together.
Finally, words associated together were more positive in Parkinson's UK corpus, e.g. 'big difference', or 'better treatments', while their readers talked also about 'passed away' or 'mental health'.
Unique expressions¶
PUK and Not PUK talked about dad and mum at a different frequency; although we've seen that the difference between mum/dad was proportionally relatively similar. However, the context in which they appear could be quite different.
I first looked at the 'unique expressions' which included either variation of the female parent (mum, mums, mother, and mothers) and the same for dads.
printmd('**Unique expressions of PUK for the female parent: **')
print find_unique_exp(textnltk_puk,r"<.*> | <.*> | <.*> | <.*> " )
printmd('** and for the male parent**')
print find_unique_exp(textnltk_puk,r" <.*> | <.*> | <.*> | <.*> " )
printmd('------------------')
printmd('**Unique expressions of PUK readers for the female parent: **')
print find_unique_exp(textnltk_pukreaders,r" <.*> | <.*> | <.*> | <.*>" )
printmd( '** and for the male parent**')
print find_unique_exp(textnltk_pukreaders,r" <.*> | <.*> | <.*> | <.*> " )
Since PUK's posts feature news and stories/interviews, the context in which the male and female parents uses varied pronouns 'my', 'his', 'their', .. This shows that the text's authors have multiple relationship to the parent that is discussed. On the other hand, authors that are not Parkinson's UK essentially talk about their own parent, highlighting that they have a similar relationship to 'parents'.
Interestingly, the female parent, who was less often mentioned than the male parent, seem to be associated with multiple positive adjectives: best, lovely, proud, precious, amazing.
Concordance¶
In order to understand better the context in which words are used, it is also possible to look at the concordance: the occurences of the word, in its context. It centers the word in focus to highlight the context in which it is used.
printmd("**Parkinson's UK**")
print textnltk_puk.concordance('mum')
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.concordance('mum')
printmd("**Parkinson's UK**")
print textnltk_puk.concordance('dad')
printmd("**Parkinson's UK readers**")
print textnltk_pukreaders.concordance('dad')
Here again, we see that most discussions involving parents are positive and upliftling stories for Parkinson's UK. Their readers, on the other hand, have more ambivalent stories: they express positive feelings towards their parent, but tell sometimes sad stories.
Bigrams¶
I finally looked into collocations, which are words that appear with one specific keyword. With this technique, it is possible to focus again on one specific keyword, and to quantify the words with which they appear.
I wanted to explore the relationship that readers have with specific members of their family. So I looked at the most associated words for 'my'.
from collections import defaultdict
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(textnltk_puk)
finder_pukreaders = nltk.collocations.BigramCollocationFinder.from_words(textnltk_pukreaders)
scored = finder.score_ngrams(bgm.likelihood_ratio)
scored_pukreaders = finder_pukreaders.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
prefix_keys_high = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
# Group bigrams by first word in bigram.
prefix_keys_pukreaders = collections.defaultdict(list)
for key, scores in scored_pukreaders:
prefix_keys_pukreaders[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys_pukreaders:
prefix_keys_pukreaders[key].sort(key = lambda x: -x[1])
printmd('**MY**')
bigram_my = prefix_keys['my'][:30]
bigram_my_pukreaders = prefix_keys_pukreaders['my'][:30]
df_bg_my = pd.DataFrame(bigram_my, columns=['word', 'PUK'])
df_bg_my_pukreaders = pd.DataFrame(bigram_my_pukreaders, columns=['word', 'PUKreaders'])
df_bg_both_my = pd.merge(df_bg_my_pukreaders,df_bg_my,on='word',how='outer')
df_bg_both_my['diff'] = df_bg_both_my['PUK'] - df_bg_both_my['PUKreaders']
df_bg_both_my = df_bg_both_my.sort_values('diff')
df_bg_both_my['word'] = df_bg_both_my['word'].str.encode('utf-8')
df_plot_bg = pd.melt(df_bg_both_my,id_vars=['word'],value_vars=['PUK','PUKreaders'])
bar = Bar(df_plot_bg, label=CatAttr(columns=['word'], sort=False), values='value',
stack='variable', title="Bigrams for 'my'",
width=550, height=300,legend='top_right',bar_width=0.7)
bar.xaxis.major_label_orientation = 20
bar.xaxis.major_label_text_font_size = '8pt'
show(bar,notebook_handle=True);
Dad is by far the most associated word with my in Parkinson's UK corpus. Parkinson's UK is however associating equally my with dad and wife, and then mum and husband. This suggest we could also look into the difference between husband and wife to find some other trends.
In the list of bigrams, 'New' caught my attention. I thought it would help finding what expectations and hopes their authors have.
printmd('**NEW**')
bigram_new = prefix_keys['new'][:20]
bigram_new_pukreaders = prefix_keys_pukreaders['new'][:20]
df_bg_new = pd.DataFrame(bigram_new, columns=['word', 'PUK'])
df_bg_new_pukreaders = pd.DataFrame(bigram_new_pukreaders, columns=['word', 'PUKreaders'])
df_bg_both = pd.merge(df_bg_new_pukreaders,df_bg_new,on='word',how='outer')
df_bg_both['diff'] = df_bg_both['PUK'] - df_bg_both['PUKreaders']
df_bg_both = df_bg_both.sort_values('diff')
df_plot_bg_new = pd.melt(df_bg_both,id_vars=['word'],value_vars=['PUK','PUKreaders'])
bar_bg_new = Bar(df_plot_bg_new, label=CatAttr(columns=['word'], sort=False), values='value',
stack='variable', title="Bigrams for 'new'",
width=550, height=300,legend='top_right',bar_width=0.7)
bar_bg_new.xaxis.major_label_orientation = 20
bar_bg_new.xaxis.major_label_text_font_size = '8pt'
show(bar_bg_new,notebook_handle=True);
Parkinson's UK was more interested in new research, study, laws while their readers talked more about new treatment or medication, as well as products or formula. This suggests that Parkinson's UK focuses on the future, by finding new ways to improve the life of people with Parkinson's, while their readers focus on the present, and discuss new medication.
I was surprised to find new lawn in PUK's readers.. so I looked at the concordance
and found that only one person had talked about new lawn. Actually, in our text, bigrams with scores less than 20 are bigrams that only appear once, so they can be a bit misleading as they should not be looked as frequent, but rather just present.
1 - c. Conclusion of Part 1¶
I've here used multiple Natural Language Processing tools:
- Most frequent words
- Collocations
- Unique expressions
- Concordance
- Bigrams
These tools helped the analysis of Facebook posts from two different groups: Parkinson's UK and their readers. It highlighted the common and diverging interest or priorities, but also highlighted some interesting gender imbalance in how people are affected by Parkinson's.
Interests and priorities
Parkinson's UK and their readers shared similar interests: diagnoses, raising awareness, raising money, research, and medications. However, priorities differed slighlty, with Parkinson's UK focus on research and their readers focus on medication. Parkinson's UK also presented stories that were telling positive stories, while their readers talked positively about their parent but had more negative expressions. I did not conduct any sentiment analysis on this text, but this would probably find a similar pattern.
Gender imbalance
Two main factors influence the gender of those who are talked about on Facebook. First, the number of people with Parkinson's; and second, the people who post on Facebook. Recent research finds a higher prevalence of Parkinson's in male than female, possibly due to the role of hormones in the development of the condition. Therefore more 'dads' might be talked about. On the other hand, as often stressed by Parkinson's UK, those who are affected by Parkinson's are not just those who develop it, but also all their families. This is supported by a reading of the text, which shows that when a person talks about their 'dad', they also mention their 'mum': in this case the dad is the person with Parkinson's, while the mum is the carer/family member. During the project, this was very quickly mentioned as a possible issue in applying Natural Langage Processing tools. When considering the same generation male vs female, we should remember that women might be sharing more than men on Facebook. Therefore, more husbands are likely to be mentioned, regardless of Parkinson's prevalence in the population.